In this Markdown we will look at a one example where we will go through much of the content of the course covered thus far, with particular emphasis on some tools for exploring potential predictor transformations in multiple regression.

Big Example in R: Upper Ozone Concentration in Los Angeles Basin

Ozone is one of the nasty constituents of photochemical smog, and its concentration is the standard indicator of the severity of such smog. If its level is high, a smog alert is called. Ozone is not emitted directly into the atmosphere. Rather, it is a product of chemical reactions that require solar radiation and emissions of primary pollutants from smoke stacks and automobile exhaust. When the ventilation of the atmosphere is low, the chemical reactions bring ozone to high levels. Low ventilation occurs when wind speeds are low and temperature is high because on hot, calm days in the summer the atmosphere cannot cleanse itself. The goal in the analysis of these data is to determine how ozone depends on the other variables.

A brief description of the variables and their abbreviations is given below:

We begin our analysis by examining a scatterplot matrix of the data. It appears that upper ozone concentration is related to several of the predictors, although not linearly. We also see that several of the variables are correlated with one another.

Reading in the Data

First read the data in from OzoneData.csv and load the car library.

require(car)
## Loading required package: car
Ozone = read.csv(file="http://course1.winona.edu/bdeppa/Regression/Data/Ozone.csv")
names(Ozone)
##  [1] "upoz" "day"  "v500" "wind" "hum"  "safb" "inbh" "dagg" "inbt" "vis"
head(Ozone)
##   upoz day v500 wind hum safb inbh dagg inbt vis
## 1    3   3 5710    4  28   40 2693  -25   87 250
## 2    5   4 5700    3  37   45  590  -24  128 100
## 3    5   5 5760    3  51   54 1450   25  139  60
## 4    6   6 5720    4  69   35 1568   15  121  60
## 5    4   7 5790    6  19   45 2631  -33  123 100
## 6    4   8 5790    3  25   55  554  -28  182 250
summary(Ozone)
##       upoz            day              v500           wind       
##  Min.   : 1.00   Min.   :  3.00   Min.   :5320   Min.   : 0.000  
##  1st Qu.: 5.00   1st Qu.: 90.25   1st Qu.:5690   1st Qu.: 3.000  
##  Median :10.00   Median :177.50   Median :5760   Median : 5.000  
##  Mean   :11.78   Mean   :181.73   Mean   :5750   Mean   : 4.891  
##  3rd Qu.:17.00   3rd Qu.:275.75   3rd Qu.:5830   3rd Qu.: 6.000  
##  Max.   :38.00   Max.   :365.00   Max.   :5950   Max.   :21.000  
##       hum             safb            inbh             dagg       
##  Min.   :19.00   Min.   :25.00   Min.   : 111.0   Min.   :-69.00  
##  1st Qu.:47.00   1st Qu.:51.00   1st Qu.: 877.5   1st Qu.: -9.00  
##  Median :64.00   Median :62.00   Median :2112.5   Median : 24.00  
##  Mean   :58.13   Mean   :61.75   Mean   :2572.9   Mean   : 17.37  
##  3rd Qu.:73.00   3rd Qu.:72.00   3rd Qu.:5000.0   3rd Qu.: 44.75  
##  Max.   :93.00   Max.   :93.00   Max.   :5000.0   Max.   :107.00  
##       inbt            vis       
##  Min.   :-25.0   Min.   :  0.0  
##  1st Qu.:107.0   1st Qu.: 70.0  
##  Median :167.5   Median :120.0  
##  Mean   :161.2   Mean   :124.5  
##  3rd Qu.:214.0   3rd Qu.:150.0  
##  Max.   :332.0   Max.   :350.0

The pairs.plus command the Regression.RData directory that sent you at the beginning of the course will create enhanced scatterplot matrix.

pairs.plus(Ozone)

Fit Preliminary Model

We begin by fitting a baseline model using \(Y = Ozone\) as the response and all of the predictors in the original form as the terms in the model. We will examine a model summary of this model and look at a set of four diagnostic plots for assessing the adequacy of the model using the plot(model) command. Note that as there are four plots in this set I have used the command par(mfrow=c(2,2)) to set up a plotting multiple figure region with 2 rows and 2 columns of plots.

lm.oz1 = lm(upoz~.,data=Ozone)
summary(lm.oz1)
## 
## Call:
## lm(formula = upoz ~ ., data = Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2407  -2.8832  -0.3353   2.7409  13.3523 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.3792938 29.5045242   0.623  0.53377    
## day         -0.0088490  0.0027199  -3.253  0.00126 ** 
## v500        -0.0051340  0.0053950  -0.952  0.34200    
## wind        -0.0198304  0.1238829  -0.160  0.87292    
## hum          0.0804923  0.0188345   4.274 2.54e-05 ***
## safb         0.2743349  0.0497361   5.516 7.17e-08 ***
## inbh        -0.0002497  0.0002950  -0.846  0.39798    
## dagg        -0.0036968  0.0112925  -0.327  0.74360    
## inbt         0.0292640  0.0136115   2.150  0.03231 *  
## vis         -0.0080742  0.0037565  -2.149  0.03235 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.441 on 320 degrees of freedom
## Multiple R-squared:  0.7011, Adjusted R-squared:  0.6927 
## F-statistic:  83.4 on 9 and 320 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.oz1)

par(mfrow=c(1,1))

The residual plot suggest both curvature (i.e. the model is inadequate) and the variation appears to be constant. We can formally test these using Tukey’s test for nonadditivity and the Score test. The car library contains functions for performing both of these tests. The command residualPlots constructs a series of residual plots: \((\hat{e_i}\ vs. each\ term)\) and \(\hat{e_i}\ vs.\ \hat{y_i}\). If you supply the tests=T option to this command it will tests the significance of adding a squared term for each predictor and conducts overall Tukey test using the fitted values squared \(\hat{y_i}^2\) as term to the model. The function ncvTest(model) we conduct the overall score test again using the fitted values \(\hat{y_i}\) to model the squared residuals or a custom model using specified terms to model the variance.

Testing for Curvature and Non-constant Variance

# Manually conducting the Tukey Test for Nonadditivity
yhat2 = fitted(lm.oz1)^2
lm.tukey = lm(upoz~.+yhat2,data=Ozone)
summary(lm.tukey)
## 
## Call:
## lm(formula = upoz ~ . + yhat2, data = Ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.124  -2.307  -0.094   2.454  12.713 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.916e+01  2.753e+01  -1.059    0.290    
## day         -1.865e-03  2.625e-03  -0.711    0.478    
## v500         6.296e-03  5.116e-03   1.231    0.219    
## wind        -1.038e-01  1.134e-01  -0.916    0.361    
## hum          1.569e-02  1.894e-02   0.829    0.408    
## safb        -1.318e-02  5.753e-02  -0.229    0.819    
## inbh        -3.147e-04  2.691e-04  -1.169    0.243    
## dagg         5.332e-03  1.036e-02   0.515    0.607    
## inbt        -6.227e-03  1.316e-02  -0.473    0.636    
## vis         -1.850e-04  3.560e-03  -0.052    0.959    
## yhat2        3.936e-02  4.846e-03   8.122 1.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.049 on 319 degrees of freedom
## Multiple R-squared:  0.7523, Adjusted R-squared:  0.7446 
## F-statistic:  96.9 on 10 and 319 DF,  p-value: < 2.2e-16
residualPlots(lm.oz1,tests=T)

##            Test stat Pr(>|t|)
## day           -5.899    0.000
## v500           3.861    0.000
## wind           0.581    0.561
## hum           -1.476    0.141
## safb           5.762    0.000
## inbh           0.602    0.548
## dagg          -4.352    0.000
## inbt           5.108    0.000
## vis            1.980    0.049
## Tukey test     8.122    0.000
# NCV plot along with test using ncvTest from car library
ncv.plot(lm.oz1)

# Overall Score Test
ncvTest(lm.oz1) 
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 35.04723    Df = 1     p = 3.218045e-09
# Custom non-constant variance test using specified terms
ncvTest(lm.oz1,~v500+hum+dagg+inbt,data=Ozone)
## Non-constant Variance Score Test 
## Variance formula: ~ v500 + hum + dagg + inbt 
## Chisquare = 45.29404    Df = 4     p = 3.453833e-09

Inverse Response Plot

Despite the fact some the predictors appear to have mildly nonlinear relationships between them, we will examine an inverse fitted value plot \((\hat{y_i}\ vs. y_i)\) using the invResPlot command from the car library. The \(\hat{\lambda}\) value reported is the “optimal” transformation for fitting the relationship exhibited in the inverse fitted value plot.

invResPlot(lm.oz1)

##      lambda      RSS
## 1  0.437994 3976.284
## 2 -1.000000 8034.225
## 3  0.000000 4334.323
## 4  1.000000 4424.844

The optimal power transformation from the inverse fitted value plot is \(\hat{\lambda}=0.44\), which would lead us to consider possibly a square root or cube root transformation.

Box-Cox Transformation of the Response

The function BCtran in my Regression.RData will conduct the Box-Cox transformation procedure for the response. Also the function powerTransform in the car library will do the same. Recall this procedure is finding a power transform for the response \(Y\) which makes it the most normally distributed in the transformed scale. It generally will not agree with the power transform suggested by the inverse fitted value plot.

BCtran(Ozone$upoz)

summary(powerTransform(Ozone$upoz))
## bcPower Transformation to Normality 
##            Est Power Rounded Pwr Wald Lwr bnd Wald Upr Bnd
## Ozone$upoz    0.1785        0.18       0.0331       0.3239
## 
## Likelihood ratio tests about transformation parameters
##                              LRT df       pval
## LR test, lambda = (0)   5.922072  1 0.01495236
## LR test, lambda = (1) 110.044096  1 0.00000000

Fitting Some Models with a Transformed Response

We will now compare a few models with the response transformed \(T(Y)=\sqrt{Y}\), \(T(Y)=Y^{0.25}\), \(T(Y)=Y^{0.33}\), and \(T(Y)=log(Y)\).

oz.sr = lm(upoz^0.5~.,data=Ozone)
oz.fr = lm(upoz^0.25~.,data=Ozone)
oz.cr = lm(upoz^0.333~.,data=Ozone)
oz.log = lm(log(upoz)~.,data=Ozone)
par(mfrow=c(2,2))
plot(oz.sr)

plot(oz.fr)

plot(oz.cr)

plot(oz.log)

par(mfrow=c(1,1))

Of these four potential models the cube root transformation of the response appears to be the best. However all of them could potentially be used. We will now examine residual plots for the cube root response model, test for curvature, and test for nonconstant variance.

residualPlots(oz.cr)

##            Test stat Pr(>|t|)
## day           -7.778    0.000
## v500           1.483    0.139
## wind           0.447    0.656
## hum           -3.010    0.003
## safb           2.131    0.034
## inbh          -1.315    0.189
## dagg          -5.430    0.000
## inbt           1.612    0.108
## vis            2.093    0.037
## Tukey test     3.636    0.000
ncvTest(oz.cr)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.08783124    Df = 1     p = 0.7669526

Several of the tests for curvature suggest our mean function still needs some work, while the nonconstant variance no longer is an issue. We will now use tools for identifying predictor transformations to address the curvature issue.

C+R Plots and CERES Plots

Before using tools to transform the predictors we will use stepwise methods (using the command step) to drop some of the predictors that do not appear to be important via backward elimination. However, we could also consider seeking transformations first and reduce the model.

oz.step = step(oz.cr)
## Start:  AIC=-850.59
## upoz^0.333 ~ day + v500 + wind + hum + safb + inbh + dagg + inbt + 
##     vis
## 
##        Df Sum of Sq    RSS     AIC
## - v500  1   0.00765 23.600 -852.49
## - wind  1   0.00790 23.601 -852.48
## - dagg  1   0.00829 23.601 -852.48
## - inbt  1   0.08898 23.682 -851.35
## <none>              23.593 -850.59
## - vis   1   0.45222 24.045 -846.33
## - inbh  1   0.47372 24.066 -846.03
## - hum   1   1.00407 24.597 -838.84
## - day   1   1.07978 24.672 -837.82
## - safb  1   3.09264 26.685 -811.94
## 
## Step:  AIC=-852.49
## upoz^0.333 ~ day + wind + hum + safb + inbh + dagg + inbt + vis
## 
##        Df Sum of Sq    RSS     AIC
## - wind  1    0.0053 23.606 -854.41
## - dagg  1    0.0093 23.610 -854.36
## - inbt  1    0.0838 23.684 -853.32
## <none>              23.600 -852.49
## - vis   1    0.4462 24.047 -848.30
## - inbh  1    0.5600 24.160 -846.75
## - hum   1    1.0414 24.642 -840.24
## - day   1    1.1086 24.709 -839.34
## - safb  1    3.1665 26.767 -812.94
## 
## Step:  AIC=-854.41
## upoz^0.333 ~ day + hum + safb + inbh + dagg + inbt + vis
## 
##        Df Sum of Sq    RSS     AIC
## - dagg  1    0.0078 23.613 -856.30
## - inbt  1    0.0824 23.688 -855.26
## <none>              23.606 -854.41
## - vis   1    0.4671 24.073 -849.94
## - inbh  1    0.5984 24.204 -848.15
## - hum   1    1.0427 24.648 -842.15
## - day   1    1.1383 24.744 -840.87
## - safb  1    3.1625 26.768 -814.92
## 
## Step:  AIC=-856.3
## upoz^0.333 ~ day + hum + safb + inbh + inbt + vis
## 
##        Df Sum of Sq    RSS     AIC
## - inbt  1    0.0774 23.691 -857.22
## <none>              23.613 -856.30
## - vis   1    0.4639 24.077 -851.88
## - inbh  1    0.6431 24.256 -849.43
## - day   1    1.2160 24.829 -841.73
## - hum   1    1.7328 25.346 -834.93
## - safb  1    4.4251 28.039 -801.62
## 
## Step:  AIC=-857.22
## upoz^0.333 ~ day + hum + safb + inbh + vis
## 
##        Df Sum of Sq    RSS     AIC
## <none>              23.691 -857.22
## - vis   1    0.5079 24.199 -852.22
## - day   1    1.1388 24.830 -843.73
## - hum   1    1.7220 25.413 -836.07
## - inbh  1    2.6694 26.360 -823.99
## - safb  1   20.4115 44.102 -654.15
summary(oz.step)
## 
## Call:
## lm(formula = upoz^0.333 ~ day + hum + safb + inbh + vis, data = Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71266 -0.17297  0.01515  0.19058  0.72140 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.873e-01  1.087e-01   8.165 7.26e-15 ***
## day         -5.981e-04  1.515e-04  -3.946 9.73e-05 ***
## hum          4.093e-03  8.434e-04   4.853 1.89e-06 ***
## safb         2.221e-02  1.330e-03  16.708  < 2e-16 ***
## inbh        -6.261e-05  1.036e-05  -6.042 4.18e-09 ***
## vis         -5.929e-04  2.250e-04  -2.635  0.00881 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2704 on 324 degrees of freedom
## Multiple R-squared:  0.7331, Adjusted R-squared:  0.7289 
## F-statistic:   178 on 5 and 324 DF,  p-value: < 2.2e-16
anova(oz.step,oz.cr)
## Analysis of Variance Table
## 
## Model 1: upoz^0.333 ~ day + hum + safb + inbh + vis
## Model 2: upoz^0.333 ~ day + v500 + wind + hum + safb + inbh + dagg + inbt + 
##     vis
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    324 23.691                           
## 2    320 23.593  4  0.098155 0.3328 0.8558
residualPlots(oz.step)

##            Test stat Pr(>|t|)
## day           -6.002    0.000
## hum           -2.397    0.017
## safb           2.186    0.030
## inbh          -1.273    0.204
## vis            2.087    0.038
## Tukey test     3.470    0.001

The Big F-test supports the removal of the terms dropped via backward elimination \((p=.8558)\). The residual plots and tests for curvature for the reduced model suggest curvature is an issue for the reduced model.

We will now use C+R plots and CERES plots to help us find terms to address the curvature using only the predictors remaining after backward elimination.

crPlots(oz.step)

ceresPlots(oz.step)

Polynomial Terms Added to Model

We will fit an overly ambitious polynomial model based upon some of the curvature exhibited in the CERES plots.

oz.poly = lm(upoz^.333~poly(day,3)+poly(hum,3)+poly(safb,2)+poly(inbh,3)+poly(vis,2),data=Ozone)
summary(oz.poly)
## 
## Call:
## lm(formula = upoz^0.333 ~ poly(day, 3) + poly(hum, 3) + poly(safb, 
##     2) + poly(inbh, 3) + poly(vis, 2), data = Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78773 -0.14896  0.00627  0.16219  0.65946 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.15345    0.01364 157.858  < 2e-16 ***
## poly(day, 3)1  -0.84652    0.27964  -3.027 0.002672 ** 
## poly(day, 3)2  -2.04929    0.41292  -4.963 1.14e-06 ***
## poly(day, 3)3   0.52708    0.27846   1.893 0.059289 .  
## poly(hum, 3)1   0.33795    0.33665   1.004 0.316209    
## poly(hum, 3)2  -0.42249    0.26847  -1.574 0.116561    
## poly(hum, 3)3  -0.01096    0.25632  -0.043 0.965911    
## poly(safb, 2)1  4.72119    0.45523  10.371  < 2e-16 ***
## poly(safb, 2)2  0.32885    0.26052   1.262 0.207783    
## poly(inbh, 3)1 -2.22061    0.33321  -6.664 1.18e-10 ***
## poly(inbh, 3)2 -0.54278    0.26977  -2.012 0.045067 *  
## poly(inbh, 3)3  0.65783    0.25604   2.569 0.010652 *  
## poly(vis, 2)1  -1.22464    0.30574  -4.005 7.72e-05 ***
## poly(vis, 2)2   0.95216    0.26349   3.614 0.000351 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2478 on 316 degrees of freedom
## Multiple R-squared:  0.7813, Adjusted R-squared:  0.7723 
## F-statistic: 86.86 on 13 and 316 DF,  p-value: < 2.2e-16

The squared terms for SAFB is not significant so we will drop it.

oz.poly2 = lm(upoz^.333~poly(day,3)+poly(hum,3)+safb+poly(inbh,3)+poly(vis,2),data=Ozone)
summary(oz.poly2)
## 
## Call:
## lm(formula = upoz^0.333 ~ poly(day, 3) + poly(hum, 3) + safb + 
##     poly(inbh, 3) + poly(vis, 2), data = Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80618 -0.15861  0.00941  0.17245  0.64571 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.046626   0.108091   9.683  < 2e-16 ***
## poly(day, 3)1  -0.882526   0.278445  -3.169 0.001676 ** 
## poly(day, 3)2  -2.100770   0.411286  -5.108 5.64e-07 ***
## poly(day, 3)3   0.517840   0.278620   1.859 0.064013 .  
## poly(hum, 3)1   0.358555   0.336564   1.065 0.287533    
## poly(hum, 3)2  -0.447914   0.267965  -1.672 0.095603 .  
## poly(hum, 3)3  -0.009907   0.256562  -0.039 0.969223    
## safb            0.017923   0.001736  10.322  < 2e-16 ***
## poly(inbh, 3)1 -2.205219   0.333294  -6.616 1.57e-10 ***
## poly(inbh, 3)2 -0.488572   0.266579  -1.833 0.067778 .  
## poly(inbh, 3)3  0.702464   0.253828   2.767 0.005981 ** 
## poly(vis, 2)1  -1.236083   0.305896  -4.041 6.69e-05 ***
## poly(vis, 2)2   0.977910   0.262944   3.719 0.000236 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.248 on 317 degrees of freedom
## Multiple R-squared:  0.7802, Adjusted R-squared:  0.7719 
## F-statistic: 93.79 on 12 and 317 DF,  p-value: < 2.2e-16

Only the squared humidity terms is significant so we might consider simply squaring humidity and using that single term in our model, i.e. \(U_j=X_7^2=(humidity)^2\).

hum2 = Ozone$hum^2
oz.poly3 = lm(upoz^.333~poly(day,3)+hum2+safb+poly(inbh,3)+poly(vis,2),data=Ozone)
summary(oz.poly3)
## 
## Call:
## lm(formula = upoz^0.333 ~ poly(day, 3) + hum2 + safb + poly(inbh, 
##     3) + poly(vis, 2), data = Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81164 -0.15697  0.01347  0.16724  0.66436 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.079e+00  1.149e-01   9.388  < 2e-16 ***
## poly(day, 3)1  -9.111e-01  2.782e-01  -3.275 0.001171 ** 
## poly(day, 3)2  -2.349e+00  3.891e-01  -6.039 4.32e-09 ***
## poly(day, 3)3   5.166e-01  2.789e-01   1.852 0.064923 .  
## hum2            4.848e-06  8.933e-06   0.543 0.587722    
## safb            1.711e-02  1.677e-03  10.204  < 2e-16 ***
## poly(inbh, 3)1 -2.325e+00  3.265e-01  -7.122 7.09e-12 ***
## poly(inbh, 3)2 -5.261e-01  2.626e-01  -2.003 0.046000 *  
## poly(inbh, 3)3  6.315e-01  2.514e-01   2.512 0.012483 *  
## poly(vis, 2)1  -1.287e+00  3.042e-01  -4.230 3.06e-05 ***
## poly(vis, 2)2   9.621e-01  2.631e-01   3.657 0.000298 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2486 on 319 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.7709 
## F-statistic: 111.7 on 10 and 319 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(oz.poly3)

par(mfrow=c(1,1))
residualPlots(oz.poly3)

##               Test stat Pr(>|t|)
## poly(day, 3)         NA       NA
## hum2             -1.715    0.087
## safb              1.408    0.160
## poly(inbh, 3)        NA       NA
## poly(vis, 2)         NA       NA
## Tukey test        1.249    0.211
oz.poly.step = step(oz.poly3)
## Start:  AIC=-907.89
## upoz^0.333 ~ poly(day, 3) + hum2 + safb + poly(inbh, 3) + poly(vis, 
##     2)
## 
##                 Df Sum of Sq    RSS     AIC
## - hum2           1    0.0182 19.730 -909.59
## <none>                       19.712 -907.89
## - poly(vis, 2)   2    1.7204 21.433 -884.28
## - poly(inbh, 3)  3    3.5529 23.265 -859.20
## - poly(day, 3)   3    4.5215 24.234 -845.74
## - safb           1    6.4337 26.146 -816.68
## 
## Step:  AIC=-909.59
## upoz^0.333 ~ poly(day, 3) + safb + poly(inbh, 3) + poly(vis, 
##     2)
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       19.730 -909.59
## - poly(vis, 2)   2    2.1790 21.910 -879.02
## - poly(inbh, 3)  3    3.6498 23.380 -859.58
## - poly(day, 3)   3    5.4419 25.172 -835.20
## - safb           1    6.4993 26.230 -817.63
summary(oz.poly.step)
## 
## Call:
## lm(formula = upoz^0.333 ~ poly(day, 3) + safb + poly(inbh, 3) + 
##     poly(vis, 2), data = Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81107 -0.16265  0.01113  0.16413  0.65740 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.106261   0.102909  10.750  < 2e-16 ***
## poly(day, 3)1  -0.903432   0.277495  -3.256 0.001252 ** 
## poly(day, 3)2  -2.446519   0.345127  -7.089 8.69e-12 ***
## poly(day, 3)3   0.487751   0.273491   1.783 0.075464 .  
## safb            0.016957   0.001652  10.267  < 2e-16 ***
## poly(inbh, 3)1 -2.336157   0.325503  -7.177 5.00e-12 ***
## poly(inbh, 3)2 -0.552420   0.257824  -2.143 0.032897 *  
## poly(inbh, 3)3  0.638250   0.250788   2.545 0.011397 *  
## poly(vis, 2)1  -1.343946   0.285044  -4.715 3.62e-06 ***
## poly(vis, 2)2   0.993020   0.256543   3.871 0.000132 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2483 on 320 degrees of freedom
## Multiple R-squared:  0.7777, Adjusted R-squared:  0.7714 
## F-statistic: 124.4 on 9 and 320 DF,  p-value: < 2.2e-16

The residual plots and curvature tests suggest no model violation deficiencies. To further refine our model we again perform stepwise selection, which leads to the humidity squared term being dropped further simplifying our model.

This example does not lead us to the BEST model, but it is adequate.

Remember all models are WRONG but some are useful!